The dataset that I am using in this project was found on Kaggle, the well-known Machine Learning Competition website. Click here for a full description of the dataset, or read the description file.
I worked in the automotive industry for 12 years and I remain a devoted pistonhead, so getting a better understanding of the used car market was very appealing.
This project focuses on the exploratory data analysis phase of the dataset. In particular, I will try to detect associations between variables, especially against price. The end-goal of such a project would be to build a price-prediction model for vehicles sold by eBay users.
The dataset is well structured but there are some free text fields and many missing values. Moreover, some of the data is in German and needs to be translated. Most of the translations are straightforward, and Google Translate comes to the rescue where required! The “name” column is problematic: It is free text which causes all sorts of issues, and although a German NLP engineer could perhaps find interesting information in it, I chose to simply drop it.
## dateCrawled seller offerType price abtest vehicleType
## 1: 2016-03-24 11:52:17 privat Angebot 480 test NA
## 2: 2016-03-24 10:58:45 privat Angebot 18300 test coupe
## 3: 2016-03-14 12:52:21 privat Angebot 9800 test suv
## 4: 2016-03-17 16:54:04 privat Angebot 1500 test kleinwagen
## 5: 2016-03-31 17:25:20 privat Angebot 3600 test kleinwagen
## 6: 2016-04-04 17:36:23 privat Angebot 650 test limousine
## yearOfRegistration gearbox powerPS model kilometer
## 1: 1993 manuell 0 golf 150000
## 2: 2011 manuell 190 NA 125000
## 3: 2004 automatik 163 grand 125000
## 4: 2001 manuell 75 golf 150000
## 5: 2008 manuell 69 fabia 90000
## 6: 1995 manuell 102 3er 150000
## monthOfRegistration fuelType brand notRepairedDamage dateCreated
## 1: 0 benzin volkswagen NA 2016-03-24
## 2: 5 diesel audi ja 2016-03-24
## 3: 8 diesel jeep NA 2016-03-14
## 4: 6 benzin volkswagen nein 2016-03-17
## 5: 7 diesel skoda nein 2016-03-31
## 6: 10 benzin bmw ja 2016-04-04
## nrOfPictures postalCode lastSeen
## 1: 0 70435 2016-04-07 03:16:57
## 2: 0 66954 2016-04-07 01:46:50
## 3: 0 90480 2016-04-05 12:47:46
## 4: 0 91074 2016-03-17 17:40:17
## 5: 0 60437 2016-04-06 10:17:21
## 6: 0 33775 2016-04-06 19:17:07
## Classes 'data.table' and 'data.frame': 371824 obs. of 19 variables:
## $ dateCrawled : POSIXct, format: "2016-03-24 11:52:17" "2016-03-24 10:58:45" ...
## $ seller : Factor w/ 2 levels "gewerblich","privat": 2 2 2 2 2 2 2 2 2 2 ...
## $ offerType : Factor w/ 2 levels "Angebot","Gesuch": 1 1 1 1 1 1 1 1 1 1 ...
## $ price : int 480 18300 9800 1500 3600 650 2200 0 14500 999 ...
## $ abtest : Factor w/ 2 levels "control","test": 2 2 2 2 2 2 2 2 1 2 ...
## $ vehicleType : Factor w/ 8 levels "andere","bus",..: NA 4 8 5 5 7 3 7 2 5 ...
## $ yearOfRegistration : int 1993 2011 2004 2001 2008 1995 2004 1980 2014 1998 ...
## $ gearbox : Factor w/ 2 levels "automatik","manuell": 2 2 1 2 2 2 2 2 2 2 ...
## $ powerPS : int 0 190 163 75 69 102 109 50 125 101 ...
## $ model : Factor w/ 251 levels "100","145","147",..: 119 NA 120 119 104 12 9 41 62 119 ...
## $ kilometer : int 150000 125000 125000 150000 90000 150000 150000 40000 30000 150000 ...
## $ monthOfRegistration: int 0 5 8 6 7 10 8 7 8 0 ...
## $ fuelType : Factor w/ 7 levels "andere","benzin",..: 2 4 4 2 4 2 2 2 2 NA ...
## $ brand : Factor w/ 40 levels "alfa_romeo","audi",..: 39 2 15 39 32 3 26 39 11 39 ...
## $ notRepairedDamage : Factor w/ 2 levels "ja","nein": NA 1 NA 2 2 1 2 2 NA NA ...
## $ dateCreated : POSIXct, format: "2016-03-24" "2016-03-24" ...
## $ nrOfPictures : int 0 0 0 0 0 0 0 0 0 0 ...
## $ postalCode : int 70435 66954 90480 91074 60437 33775 67112 19348 94505 27472 ...
## $ lastSeen : POSIXct, format: "2016-04-07 03:16:57" "2016-04-07 01:46:50" ...
## - attr(*, ".internal.selfref")=<externalptr>
The column abtest seems to be internal to E-Bay, probably the control or test groups for some internal A/B testing. I don’t believe we will need it. The column nrOfPictures only contains zeros, probably a data collection issue. We don’t need it either. I will also drop the postalCode column, because I don’t intend to cross-reference the data with a postal map of Germany although that could be another interesting project.
The values in the different factors are fairly straightforward. I translate them into English; at the same time I drop the 12 ads from people looking to purchase a car (offerType == Gesuch), as I don’t have confidence that they would have accurate information about car specifications, nor sensible asking prices. As a result, we no longer need this column.
The seller column contains only 3 professional traders, which is insignificant in comparison to the total number of observations. Therefore I drop the column.
Finally, I also noticed that some zeros should really be NAs: In price, monthOfRegistration, powerPS.
The dateCrawled and dateCreated columns might not be very useful in themselves, but they allow us to calculate how long an ad has been up for on the website, and thus gives us an approximate lower bound for selling time (see below for a discussion on this). By default this value is calculated in minutes, I convert it to days. With this new ad_up_time variable, we no longer need the other date variables.
We now have a clean and usable dataset:
## Classes 'data.table' and 'data.frame': 371809 obs. of 12 variables:
## $ price : int 480 18300 9800 1500 3600 650 2200 NA 14500 999 ...
## $ vehicleType : Factor w/ 8 levels "other","people carrier",..: NA 4 8 5 5 7 3 7 2 5 ...
## $ yearOfRegistration : int 1993 2011 2004 2001 2008 1995 2004 1980 2014 1998 ...
## $ gearbox : Factor w/ 2 levels "automatic","manual": 2 2 1 2 2 2 2 2 2 2 ...
## $ powerPS : int NA 190 163 75 69 102 109 50 125 101 ...
## $ model : Factor w/ 251 levels "100","145","147",..: 119 NA 120 119 104 12 9 41 62 119 ...
## $ kilometer : int 150000 125000 125000 150000 90000 150000 150000 40000 30000 150000 ...
## $ monthOfRegistration: int NA 5 8 6 7 10 8 7 8 NA ...
## $ fuelType : Factor w/ 7 levels "other","petrol",..: 2 4 4 2 4 2 2 2 2 NA ...
## $ brand : Factor w/ 40 levels "alfa_romeo","audi",..: 39 2 15 39 32 3 26 39 11 39 ...
## $ notRepairedDamage : Factor w/ 2 levels "yes","no": NA 1 NA 2 2 1 2 2 NA NA ...
## $ ad_up_time : num 818.53 817 1319.81 1.93 342.17 ...
## - attr(*, ".internal.selfref")=<externalptr>
## - attr(*, "index")= int
Here, I look at each of the variables independently to understand their distribution.
## price vehicleType yearOfRegistration
## Min. :1.000e+00 sedan :95962 Min. :1000
## 1st Qu.:1.250e+03 small car :80095 1st Qu.:1999
## Median :3.000e+03 estate :67625 Median :2003
## Mean :1.780e+04 people carrier:30218 Mean :2005
## 3rd Qu.:7.490e+03 convertible :22914 3rd Qu.:2008
## Max. :2.147e+09 (Other) :37103 Max. :9999
## NA's :10779 NA's :37892
## gearbox powerPS model kilometer
## automatic: 77168 Min. : 1.0 golf : 30085 Min. : 5000
## manual :274424 1st Qu.: 80.0 other : 26422 1st Qu.:125000
## NA's : 20217 Median : 116.0 3er : 20580 Median :150000
## Mean : 129.8 polo : 13105 Mean :125620
## 3rd Qu.: 150.0 corsa : 12584 3rd Qu.:150000
## Max. :20000.0 (Other):248538 Max. :150000
## NA's :40851 NA's : 20495
## monthOfRegistration fuelType brand
## Min. : 1.00 petrol :224032 volkswagen : 79693
## 1st Qu.: 4.00 diesel :107824 bmw : 40298
## Median : 6.00 lpg : 5382 opel : 40165
## Mean : 6.38 cng : 571 mercedes_benz: 35345
## 3rd Qu.: 9.00 hybrid : 280 audi : 32897
## Max. :12.00 (Other): 312 ford : 25591
## NA's :37701 NA's : 33408 (Other) :117820
## notRepairedDamage ad_up_time
## yes : 36308 Min. : 0.0
## no :263384 1st Qu.: 116.6
## NA's: 72117 Median : 349.7
## Mean : 515.3
## 3rd Qu.: 798.4
## Max. :1981.8
##
We have 12 variables which I can now plot individually.
From the summary above, we see that prices go up to over \(€2.10^9\)! This is obviously wrong. While looking at all cars over €100,000 in more detail, I noticed that many of these prices seemed either entered at random or confused with kilometers: I found patterns such as ‘111111’ or ‘12345678’, or mainstream models over €150,000. To try and filter out most of these issues, I assumed that such high-end cars would most likely be coupes, convertibles or SUVs. I also dropped any observation above €200,000, assuming the majority of them would be input errors. I then dropped any row that did not match these criteria and looked at the brands of cars above €75,000:
## [1] chevrolet porsche other mercedes_benz bmw
## [6] volkswagen audi jaguar seat ford
## [11] land_rover opel renault smart nissan
## [16] mitsubishi
## 40 Levels: alfa_romeo audi bmw chevrolet chrysler citroen dacia ... volvo
Some of these brands are not considered premium and it is surprising to find them here. Let’s see the models and prices for the non-premium brands (Volkswagen, Seat, Ford, Opel, Renault, Smart, Nissan, Mitsubishi):
## price vehicleType yearOfRegistration gearbox powerPS model
## 1: 99999 NA 1910 NA NA NA
## 2: 99999 NA 1970 NA NA other
## 3: 123456 NA 2000 manual 75 golf
## 4: 154651 NA 2005 NA NA altea
## 5: 130000 coupe 1968 NA NA mustang
## 6: 100000 coupe 1968 manual 131 other
## 7: 93000 coupe 1971 manual NA other
## 8: 100000 small car 2013 automatic NA other
## 9: 100000 NA 2000 NA NA NA
## 10: 78000 SUV 2015 automatic 340 touareg
## 11: 99999 SUV 2001 manual 114 x_trail
## 12: 99999 NA 1910 NA NA NA
## 13: 130000 NA 2000 NA NA NA
## 14: 120000 coupe 1967 manual 550 mustang
## 15: 99999 NA 2017 manual 204 golf
## 16: 85000 people carrier 1967 manual 44 transporter
## 17: 123456 NA 1985 NA NA golf
## 18: 79499 SUV 2015 automatic 340 touareg
## 19: 99999 small car 2009 automatic 71 fortwo
## kilometer monthOfRegistration fuelType brand notRepairedDamage
## 1: 150000 NA NA volkswagen NA
## 2: 150000 NA NA volkswagen NA
## 3: 150000 7 NA volkswagen NA
## 4: 150000 NA NA seat NA
## 5: 50000 7 petrol ford NA
## 6: 100000 5 petrol opel no
## 7: 30000 2 petrol renault no
## 8: 150000 11 electric smart no
## 9: 150000 NA NA volkswagen NA
## 10: 50000 4 diesel volkswagen no
## 11: 150000 9 diesel nissan no
## 12: 150000 NA NA volkswagen NA
## 13: 150000 NA NA mitsubishi NA
## 14: 100000 9 petrol ford no
## 15: 150000 2 NA volkswagen NA
## 16: 150000 1 petrol volkswagen no
## 17: 125000 NA NA volkswagen NA
## 18: 20000 5 diesel volkswagen no
## 19: 50000 10 petrol smart no
## ad_up_time
## 1: 1341.10625
## 2: 89.24861
## 3: 207.49583
## 4: 12.56250
## 5: 932.87153
## 6: 0.00000
## 7: 664.18125
## 8: 0.00000
## 9: 1791.43472
## 10: 1817.59167
## 11: 1441.67153
## 12: 125.89375
## 13: 1087.41875
## 14: 904.70556
## 15: 802.15972
## 16: 1592.41250
## 17: 854.05069
## 18: 1642.35347
## 19: 744.62014
So we find a variety of models there, some unnamed. The VW Touareg and Ford Mustang seem legitimate. There are also some really old cars, for which a high price might be justified to a collector, but in most cases the model name is not mentioned for these so it is difficult to say whether they are genuinely expensive cars or errors. Since we have more than enough observations overall, I decide to drop them and only retain the VW Touareg and Ford Mustang.
There are also cars below €100, which I assume are also errors or sellers not wanting to filter themselves out of the price selector on the website. I drop these too.
I then plot the variable again with log plus 1 transformation on the x-axis:Red bar indicates mean, blue bar indicates median
With the scale transformation, we have a roughly normal distribution. We notice that some bins have a much higher count than their neighbours, presumably corresponding to round values or “psychological” prices (e.g. €9,900).
This categorical variable has 8 levels and indicates the body style of the car (sedan, coupe, SUV etc.)
This tends to reflect the general West-European market, with a prominence of “family” vehicles and smaller volumes of “niche” products (although a sample of new car registrations would probably show a higher proportion of SUVs considering the rise of this body style in recent years). Also note the large number of NAs – about 20,000. E-Bay could definitely do a better job at encouraging their customers to write their ads properly.
This will basically tell us about the age of the vehicle. From the data summary, we saw that the minimum year is 1,000, which will come as a surprise to most historians. The maximum year is 9,999, which is obviously wrong as we will all be teleporting by then. So we clearly need to tidy up this variable. To keep things simple, I select only vehicles registered since 1960. As the data was collected in 2016, we set that year as our upper bound. Then we plot a histogram.
Red bar indicates mean, blue bar indicates median
Note that the mean and the median are identical. The distribution is close to normal, with the following exceptions:
I did some research and it turns out that 1999, 2000, 2005, 2006 were all among the strongest years for new car registrations in Germany in the last 20 years. As they are also in the heart of the used car market in terms of age, it makes sense that they would translate into these peaks.
As for 2016, the explanation is less obvious, especially as the data was collected in March and April, which is quite ealy in the year. The peak could be due to some listing errors (on purpose or not) where owners enter a date at random or to attract visitors to their ad. It could also be linked to the website’s features when creating an ad (eg. default value in drop-down menu).
But there could also be a number of genuine 2016 cars suddenly arriving on the market. Employees in the automotive industry have often access to cheap car leasing schemes, whereby they can change their vehicle every 6 months or so. More importantly, most manufacturers register large numbers of demonstrators, press units and self-registered cars (new vehicles registered by the manufacturer or its dealers, in order to artificially boost market share and / or create cheaper opportunities to capture some customers over the competition).
We should note from the summary in the beginning that monthOfRegistration contains nearly 38,000 NAs. Again, it is surprising that year seems mandatory (although its value is clearly not controlled) but month is not.
March and June are the strongest month for vehicle registrations. A quick research on the internet confirmed that this is consistent with the car registration seasonality that we observe in Germany (incidently, there is a similar effect in France and the UK).
The gearbox variable can only take two values: manual or automatic.
The European market is primarily a manual transmission market, so no surprises here. Again, there are about 15,000 NAs.
The engine power is measured using the metric PS (1PS = 736 Watts). Again, we know from the summary that there are some nonsensical values in the data. I was prepared to remove anything below 40PS, but then I realised that there are Trabants in the dataset! This venerable left-over from the East-German Communist era is now widely used in Berlin as a rental car for people looking for a different experience of the city. Its 2-stroke engine managed 26PS!. Out of respect for such an antiquity, I decided to set the lower bound at 25PS.
As for the excessively high PS values, it looks like most of them are due to people confusing power output and engine capacity (in cm\(^3\)). I decided to set the limit at 600PS, a more than respectable value.
Red bar indicates mean, blue bar indicates median
The distribution is postively skewed with a long tail (that may contain errors, as we just saw).
There are some prefered values – around 60, 100, 120, 140 for instance. These are values that have become some sort of “market standards”: Most manufacturers will offer engines around these values. It makes it easier for the consumer to compare products. Real, measurable power output of a car is never exactly equal to its rated horsepower due to variance in industrial processes. Differences of around 5% are not uncommon for a same model.
This factor variable contails 251 levels, far too many to plot. But we can select the top 20:
Without surprise, the Volkswagen Golf (the most popular car in Europe) is also number one in the dataset. Note the very large number of vehicles designated as “other” – while some of them are probably models that exist but cannot be selected on the eBay website, it is unlikely that their number would be that high so we have to assume that once again, they are mostly due to human error.
The next car on the list is the BMW 3-Series, which is by no means a cheap car. This fact alone shows that we are indeed working on the German market – the most high-end market in Europe.
… or more accurately, “kilometreage”.
Red bar indicates mean, blue bar indicates median
This histogram clearly shows that mileage is not a free input field. This is surprising, because mileage is one of the most important pieces of information when it comes to used cars, so maximum accuracy would have been desirable. Moreover, when I went to E-Bay myself to try out the used car ad generator, I was able to enter any value. So maybe there is some aggregationg mechanism during data extract or the functionality changed recently?
The second thing that is striking with this chart is the predominance of 150,000km cars. Given the nature of the data, there is no point trying to apply scale transformations to improve the plot. This variable should actually be considered as a categorical variable more than a continuous one. I therefore add a variable in the dataset called km_cat. We will see later which one is more appropriate.
This factor variable also contains many NAs (over 33,000).
“Alternative”" sources of energy are almost negligible in this dataset, which is not surprising considering that over 50% of the vehicles were 12 to 13 years old when this data was extracted.
Petrol is roughly twice as prominent as Diesel.
This is another factor with many levels (40) so we will take the same approach as with model.
The top 5 brands are German. Number 6 is Ford, which in Europe is largely perceived as German as it has its European headquarters in Cologne and many of its European products are actually designed and built in Germany. The next two brands are French, then Fiat is Italian. Seat is Spanish but it is actually part of the VW Group and their cars share almost all their components with VW products.
In other words, German manufacturers are hugely dominant on their home turf.
One issue we will have when looking for associations between variables is that with brand containing 40 levels, plots will be really hard to read (in addition to causing long processing times). We can improve this by grouping brands into categories based on brand perception. We could simply use mean prices to make these distinctions, but we would then create a correlation to price where there isn’t necessarily one. Moreover, brand perception involves a lot more than just price – there is history, perceived quality, marketing etc.
So I chose another approach: Use my domain-knowledge to manually classify the 40 levels into clusters. I didn’t plan on an exact number of clusters beforehand, just something manageable. I then intuitively grouped brands together and came up with 8 clusters which I then named.
Althought this intuitive approach is subjective, I believe it actually adds information to the dataset, unlike the “group by price” method which removes some.
There are large disparities between brand categories in terms of count. The biggest surprise is the predominance of the premium brands, which are supposedly the most expensive. But the analysis of the top 10 brands above explains it: This category contains BMW, Audi and Mercedes which are all among the most common brands in Germany. Clearly, this plot would look very different in most other countries.
The variable notRepairedDamage can only take two values: “yes” or “no”. But it does have NAs – about 72,000, which is twice as many as the number of “yes”. It does not seem like this is a mandatory field (and I could not find it on EBay). As I understand it, it refers to potential unrepaired damage on the vehicle being sold.
This is a composite variable that we created by substracting the dateCreated from the dateCrawled date. It is measured in days. Of course the idea is to look for potential correlations to other variables, especially price. From that point of view, there are important limitations associated with ad_up_time:
Red bar indicates mean, blue car indicates median
Here we have observations up to 2,000 days (nearly 5.5 years). The long tail contains a significant number of observations and there seems to be a lot of variance in the data, so it is difficult to just drop observations over an arbitrary number of days posted.
The other thing I notice is that there seems to be some prevailing values at roughly three months intervals. I am not sure why this is – it might be related to the pricing structure used by eBay…?
In this section, I will examine pairs of variables to look for associations between variables, and then zoom in on some multi-variable combinations that seem particularly interesting.
First, let’s visualize potential variable associations with a plot matrix (to reduce computing time, I use a sample of 10,000 observations):
Even so the plot matrix is very clutered so we will build individual plots for the most promising variable pairs. Eventually I would like to be able to predict prices based on the other variables, so let’s focus on price as one of our variables.
The plot matrix above shows that these two variables have the highest linear correlation so let’s start here. I make a scatterplot of price vs PowerPS with a colour-coding for fuelType (and keeping only the two most popular: petrol and Diesel – upper-case D is intentional as Diesel is the name of the inventor, Rudolf Diesel):
There clearly is a correlation between power and price, which does not seem perfectly linear but rather looks a little like some kind of root or log function. It also seems that for a given power output, Diesel cars are more expensive (not a great surprise considering that modern Diesel powertrains are usually more technologically advanced), but it is hard to tell from this plot.
Taking the log of powerPS definitely helps make the relationship look more linear for petrol cars, but it does not have very convincing results on Diesel cars. This means that when building the linear model, we will need a different set of parameters for each fuel type.
I will separate fuel types, and since we are dealing with technical specifications of vehicles, I will add an extra dimension with transmission type (gearbox):
We see that petrol cars are more spread out in terms of price and power than Diesel cars. Using the log of power, the smoothers are close to linear in the first category but not in the second one. We also notice that automatic cars tend to be both more powerful and more expensive than manual, but this will need to be confirmed later.
## log_price_cor
## 1: 0.5837277
## fuelType gearbox log_price_cor
## 1: diesel manual 0.4795269
## 2: diesel automatic 0.4939476
## 3: petrol manual 0.4917658
## 4: petrol automatic 0.5857408
So when grouped by transmission and fuel types, the correlations are all around the .5 mark, except the Petrol / Automatic combination which is around .6. We see that the associations are similar in shape although for manual Diesel cars, the data points form a much tighter group so it is harder to tell.
So far, we have found that there is a positive correlation of price with power, and perhaps associations to gearbox and fuel type.
Another continuous variable with a significant correlation to price (according to our plot matrix) is yearOfRegistration.
The distribution looks very strange.
yearOfRegistration alone. On the scatter plot, we can see that this line includes cars at just about any price from a couple hundred Euros to maybe €20,000. Strangely, the bulk of cars just older than this seems to be more expensive, which suggests that they are most probably errors (intentional or not) or people just picking the first year available on the drop-down menu when listing their car. If these cars were really 2016 cars, it would be illogical for them to be advertised at cheaper prices than cars 5 years older. This strongly advocates in favour of filtering 2016 cars out of the data.Let’s first split the data in two with a cut-off in 1995 (21 years ago) and remove the 2016 cars from the dataset (there is only 4 months’ worth of data for that year and it seems highly inaccurate). I then plot the new data with a distinction beween vintage and modern cars.
Now I would like to take a look at the correlation values:
## [1] "Vintage cars:"
##
## Pearson's product-moment correlation
##
## data: price and yearOfRegistration
## t = -69.172, df = 28265, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3904196 -0.3704793
## sample estimates:
## cor
## -0.3804937
## [1] "Modern cars:"
##
## Pearson's product-moment correlation
##
## data: price and yearOfRegistration
## t = 415.11, df = 276410, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6173838 0.6219766
## sample estimates:
## cor
## 0.6196855
The correlation is much weaker for vintage cars than for modern cars because of the greater variance. This variance comes partly from the lower count, but also probably because vintage car prices can vary enormously accoriding to seemingly irrational factors. They no longer compete with each other so market pressure is different. Moreover their value depends on factors such as: rarity, quality of restoration, use of original parts only, authenticity (faithfulness to the exact specifications of the car when it came out of the factory), historical value, part usage, car’s history etc.
Two cars built exactly the same year, 40 years ago, and selling at the same price back then, can nowadays have orders of magnitude between their current values.
In the previous section, I noticed that the variable kilometer behaves more like a categorical variable than a continuous one. However I want to try plotting both it to see which representation works best.
There is an association to log_price that looks roughly linear, except between 5,000 and 20,000km. The clearest feature is that the variance increases with mileage. Let’s compare with a boxplot, this time treating mileage as a categorical variable:
Brown diamonds represent means
Despite kilometer being grouped in bins, I believe the scatter plot is a marginally better representation because the bins are not regularly spaced, so the box plot gives a distorted view.
The 5000km data looks highly suspicious - we could be observing the same effect as with yearOfRegistration == 2016. Let’s check for that in the next set of analyses:
The smoother seems to indicate that, similarly to what we observed on price, kilometers tend to be positively correlated to the age of the car up to 15-20 years old, then negatively correlated after that. This could also be linked to the fact that vintage cars don’t tend to run as much, as they are rarely the household’s main car and are often of questionable reliability. However considering the dispersion in the early years of the dataset, I am not sure I should lend much credit to this observation. Modern cars, on the other hand, generally have a higher mileage as they get older, as common sense would have it.
We also notice a group of observations at 5,000 kilometers between 1995 and 2005 approximately, that do not seem to follow the general distribution. Since that period is precisely the one with the highest density of observations in the dataset, I suspect these cars actually have a much higher mileage and that the data is wrong. These are probably the same suspicious observations that we noticed just before, for which kilometers are most likely severely under-valued.
From the plots below, it doesn’t look like ad_up_timeis going to be very informative. The matrix plot also reports that there is virtually no correlation to any other continuous variable.
If anything, the very tenuous trends we observe are rather counter-intuitive: cars with a low ad_up_time seem to be generally cheaper and to have higher mileage than the rest.
In this section, we are going to look at combinations of discrete variables, with a stronger focus on price as this is the variable that I would like to explain.
Brown diamonds represent means
This boxplot confirms that prices are generally higher for auto transmissions than for manuals, and also higher for Diesel cars than petrol. However there is more dispersion for auto and Diesel than for manual and petrol, probably due to their lower count in the data. We also notice that there are many outliers. Here I used a linear scale for price but cut off at €50,000, and there are many more outliers above the cut-off point that cannot be seen on the plotting area.
Brown diamonds represent means
As expected, different vehicle types have different price distributions, most of them approaching normal when viewed on a logarithmic scale. The highest mean and median prices are found with SUVs, followed by convertibles, coupés and people carriers (minivans in the US). The variances of these distributions are quite large, coupés in particular. However this looks like a good contributor in explaining price differences.
Since I took the time to manually classify brands by perceived “premiumness”, let’s have a look at potential associations with price:
Brown diamonds represent means
In general, we observe a fairly logical pattern with higher prices for more premium brands. The only surpise is that the “budget” and “budget_plus” categories seem almost as expensive overall as the “premium_minus” brands. I am not sure why, maybe this has to do with the fact that some of the brands that make up these two categories are fairly recent and therefore have a younger population?
Brown diamonds represent means
## [1] "Median year of registration by brand category:"
## brand_cat V1
## 1: premium 2004
## 2: premium_minus 2004
## 3: mid_plus 2003
## 4: budget_plus 2007
## 5: mid_range 2003
## 6: mid_minus 2003
## 7: budget 2008
## 8: other 1999
It seems that my intuition was correct. These two categories are younger than the rest (with a 4-5 year difference in the medians) which would explain at least part of the observation we made previously.
So cars with unrepaired damage are much cheaper on average than cars in good condition. The log10 scale is slightly deceiving here, but in reality there is a 1-to-3 to 1-to-4 difference.
However we should note that the notion of unrepaired damage is somewhat vague. No one expect 10- or 15-year-old cars to be in immaculate condition: They will always have some scratches and bumps. However these are unlikely to drop the car’s price by a factor 3 so I conclude that to most sellers, unrepaired damage means serious damage, potentially preventing the vehicule from operating normally. Is this definition presented in the eBay guidelines? Or is it just an implicit understanding from sellers? I tried to navigate the eBay website for this information but was unable to find it.
Based on the knowledge that we gained on the previous phases of the project, I would like to attempt a linear regression to predict prices based on the most useful variables. To clean up the data further, I decided to remove all observations that are in the 5,000km group and were registered between 1990 and 2010. I decided to select the following variables:
log_pricepowerPS in interaction with fuelType, following our observations on the shape of the correlationpowerPS in interaction with fuelTypeyearOfRegistration in interaction with collector_statuskilometer in interaction with collector_statusvehicleTypebrand_catnotRepairedDamage in interaction with everything else. The reason for this is that we saw that this particular variable is associated with a 3- or 4-fold price difference, therefore for all intents and purposes it defines two separate markets (much like collector_status)##
## Call:
## lm(formula = log_price ~ (powerPS:fuelType + I(log(powerPS)):fuelType +
## gearbox + yearOfRegistration:collector_status + kilometer:collector_status +
## vehicleType + brand_cat):notRepairedDamage, data = cars_no_nas)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.33514 -0.10928 0.01774 0.12433 2.98695
##
## Coefficients: (1 not defined because of singularities)
## Estimate
## (Intercept) -7.030e+01
## gearboxautomatic:notRepairedDamageyes -2.374e+01
## gearboxmanual:notRepairedDamageyes -2.374e+01
## gearboxautomatic:notRepairedDamageno 7.917e-03
## gearboxmanual:notRepairedDamageno NA
## vehicleTypepeople carrier:notRepairedDamageyes 6.816e-02
## vehicleTypeconvertible:notRepairedDamageyes 1.853e-01
## vehicleTypecoupe:notRepairedDamageyes 6.781e-02
## vehicleTypesmall car:notRepairedDamageyes -4.818e-02
## vehicleTypeestate:notRepairedDamageyes -8.592e-02
## vehicleTypesedan:notRepairedDamageyes -1.424e-02
## vehicleTypeSUV:notRepairedDamageyes 1.738e-01
## vehicleTypepeople carrier:notRepairedDamageno 6.198e-02
## vehicleTypeconvertible:notRepairedDamageno 1.519e-01
## vehicleTypecoupe:notRepairedDamageno 5.488e-02
## vehicleTypesmall car:notRepairedDamageno -2.025e-02
## vehicleTypeestate:notRepairedDamageno -6.427e-02
## vehicleTypesedan:notRepairedDamageno -4.017e-02
## vehicleTypeSUV:notRepairedDamageno 9.883e-02
## brand_cat.L:notRepairedDamageyes 1.036e-01
## brand_cat.Q:notRepairedDamageyes -5.565e-02
## brand_cat.C:notRepairedDamageyes -1.097e-01
## brand_cat^4:notRepairedDamageyes -7.563e-02
## brand_cat^5:notRepairedDamageyes 7.784e-03
## brand_cat^6:notRepairedDamageyes -8.997e-02
## brand_cat^7:notRepairedDamageyes -9.955e-02
## brand_cat.L:notRepairedDamageno 1.093e-01
## brand_cat.Q:notRepairedDamageno -5.271e-02
## brand_cat.C:notRepairedDamageno -9.334e-02
## brand_cat^4:notRepairedDamageno -8.480e-02
## brand_cat^5:notRepairedDamageno 3.428e-02
## brand_cat^6:notRepairedDamageno -4.776e-02
## brand_cat^7:notRepairedDamageno -6.072e-02
## powerPS:fuelTypeother:notRepairedDamageyes -9.668e-04
## powerPS:fuelTypepetrol:notRepairedDamageyes 1.553e-03
## powerPS:fuelTypecng:notRepairedDamageyes -1.649e-03
## powerPS:fuelTypediesel:notRepairedDamageyes 1.296e-03
## powerPS:fuelTypeelectric:notRepairedDamageyes 3.791e-01
## powerPS:fuelTypehybrid:notRepairedDamageyes 4.343e-04
## powerPS:fuelTypelpg:notRepairedDamageyes 1.455e-03
## powerPS:fuelTypeother:notRepairedDamageno 4.973e-04
## powerPS:fuelTypepetrol:notRepairedDamageno 1.192e-03
## powerPS:fuelTypecng:notRepairedDamageno 1.318e-03
## powerPS:fuelTypediesel:notRepairedDamageno 7.701e-04
## powerPS:fuelTypeelectric:notRepairedDamageno -4.869e-03
## powerPS:fuelTypehybrid:notRepairedDamageno -3.127e-04
## powerPS:fuelTypelpg:notRepairedDamageno 6.192e-04
## fuelTypeother:I(log(powerPS)):notRepairedDamageyes 2.419e-01
## fuelTypepetrol:I(log(powerPS)):notRepairedDamageyes 1.450e-01
## fuelTypecng:I(log(powerPS)):notRepairedDamageyes 2.275e-01
## fuelTypediesel:I(log(powerPS)):notRepairedDamageyes 1.836e-01
## fuelTypeelectric:I(log(powerPS)):notRepairedDamageyes -5.754e+00
## fuelTypehybrid:I(log(powerPS)):notRepairedDamageyes 2.108e-01
## fuelTypelpg:I(log(powerPS)):notRepairedDamageyes 1.530e-01
## fuelTypeother:I(log(powerPS)):notRepairedDamageno 2.691e-01
## fuelTypepetrol:I(log(powerPS)):notRepairedDamageno 2.499e-01
## fuelTypecng:I(log(powerPS)):notRepairedDamageno 2.608e-01
## fuelTypediesel:I(log(powerPS)):notRepairedDamageno 2.874e-01
## fuelTypeelectric:I(log(powerPS)):notRepairedDamageno 4.177e-01
## fuelTypehybrid:I(log(powerPS)):notRepairedDamageno 3.355e-01
## fuelTypelpg:I(log(powerPS)):notRepairedDamageno 2.706e-01
## yearOfRegistration:collector_statusmodern:notRepairedDamageyes 4.826e-02
## yearOfRegistration:collector_statusvintage:notRepairedDamageyes 4.868e-02
## yearOfRegistration:collector_statusmodern:notRepairedDamageno 3.635e-02
## yearOfRegistration:collector_statusvintage:notRepairedDamageno 3.687e-02
## collector_statusmodern:kilometer:notRepairedDamageyes -2.584e-06
## collector_statusvintage:kilometer:notRepairedDamageyes -6.944e-06
## collector_statusmodern:kilometer:notRepairedDamageno -2.856e-06
## collector_statusvintage:kilometer:notRepairedDamageno -8.370e-06
## Std. Error
## (Intercept) 2.874e-01
## gearboxautomatic:notRepairedDamageyes 9.520e-01
## gearboxmanual:notRepairedDamageyes 9.522e-01
## gearboxautomatic:notRepairedDamageno 1.399e-03
## gearboxmanual:notRepairedDamageno NA
## vehicleTypepeople carrier:notRepairedDamageyes 1.564e-02
## vehicleTypeconvertible:notRepairedDamageyes 1.696e-02
## vehicleTypecoupe:notRepairedDamageyes 1.661e-02
## vehicleTypesmall car:notRepairedDamageyes 1.535e-02
## vehicleTypeestate:notRepairedDamageyes 1.522e-02
## vehicleTypesedan:notRepairedDamageyes 1.515e-02
## vehicleTypeSUV:notRepairedDamageyes 1.767e-02
## vehicleTypepeople carrier:notRepairedDamageno 7.317e-03
## vehicleTypeconvertible:notRepairedDamageno 7.394e-03
## vehicleTypecoupe:notRepairedDamageno 7.490e-03
## vehicleTypesmall car:notRepairedDamageno 7.280e-03
## vehicleTypeestate:notRepairedDamageno 7.231e-03
## vehicleTypesedan:notRepairedDamageno 7.203e-03
## vehicleTypeSUV:notRepairedDamageno 7.545e-03
## brand_cat.L:notRepairedDamageyes 2.728e-02
## brand_cat.Q:notRepairedDamageyes 2.728e-02
## brand_cat.C:notRepairedDamageyes 2.229e-02
## brand_cat^4:notRepairedDamageyes 1.526e-02
## brand_cat^5:notRepairedDamageyes 9.606e-03
## brand_cat^6:notRepairedDamageyes 6.831e-03
## brand_cat^7:notRepairedDamageyes 4.973e-03
## brand_cat.L:notRepairedDamageno 9.442e-03
## brand_cat.Q:notRepairedDamageno 9.482e-03
## brand_cat.C:notRepairedDamageno 7.692e-03
## brand_cat^4:notRepairedDamageno 5.222e-03
## brand_cat^5:notRepairedDamageno 3.205e-03
## brand_cat^6:notRepairedDamageno 2.228e-03
## brand_cat^7:notRepairedDamageno 1.654e-03
## powerPS:fuelTypeother:notRepairedDamageyes 1.301e-03
## powerPS:fuelTypepetrol:notRepairedDamageyes 1.061e-04
## powerPS:fuelTypecng:notRepairedDamageyes 1.909e-03
## powerPS:fuelTypediesel:notRepairedDamageyes 1.352e-04
## powerPS:fuelTypeelectric:notRepairedDamageyes 2.203e-01
## powerPS:fuelTypehybrid:notRepairedDamageyes 1.321e-03
## powerPS:fuelTypelpg:notRepairedDamageyes 2.301e-04
## powerPS:fuelTypeother:notRepairedDamageno 9.492e-04
## powerPS:fuelTypepetrol:notRepairedDamageno 2.977e-05
## powerPS:fuelTypecng:notRepairedDamageno 5.776e-04
## powerPS:fuelTypediesel:notRepairedDamageno 3.821e-05
## powerPS:fuelTypeelectric:notRepairedDamageno 8.594e-04
## powerPS:fuelTypehybrid:notRepairedDamageno 2.909e-04
## powerPS:fuelTypelpg:notRepairedDamageno 7.709e-05
## fuelTypeother:I(log(powerPS)):notRepairedDamageyes 5.417e-02
## fuelTypepetrol:I(log(powerPS)):notRepairedDamageyes 1.462e-02
## fuelTypecng:I(log(powerPS)):notRepairedDamageyes 4.638e-02
## fuelTypediesel:I(log(powerPS)):notRepairedDamageyes 1.488e-02
## fuelTypeelectric:I(log(powerPS)):notRepairedDamageyes 3.471e+00
## fuelTypehybrid:I(log(powerPS)):notRepairedDamageyes 3.945e-02
## fuelTypelpg:I(log(powerPS)):notRepairedDamageyes 1.586e-02
## fuelTypeother:I(log(powerPS)):notRepairedDamageno 2.495e-02
## fuelTypepetrol:I(log(powerPS)):notRepairedDamageno 4.656e-03
## fuelTypecng:I(log(powerPS)):notRepairedDamageno 1.423e-02
## fuelTypediesel:I(log(powerPS)):notRepairedDamageno 4.737e-03
## fuelTypeelectric:I(log(powerPS)):notRepairedDamageno 2.460e-02
## fuelTypehybrid:I(log(powerPS)):notRepairedDamageno 9.731e-03
## fuelTypelpg:I(log(powerPS)):notRepairedDamageno 5.209e-03
## yearOfRegistration:collector_statusmodern:notRepairedDamageyes 4.540e-04
## yearOfRegistration:collector_statusvintage:notRepairedDamageyes 4.602e-04
## yearOfRegistration:collector_statusmodern:notRepairedDamageno 1.439e-04
## yearOfRegistration:collector_statusvintage:notRepairedDamageno 1.462e-04
## collector_statusmodern:kilometer:notRepairedDamageyes 6.762e-08
## collector_statusvintage:kilometer:notRepairedDamageyes 1.661e-07
## collector_statusmodern:kilometer:notRepairedDamageno 1.733e-08
## collector_statusvintage:kilometer:notRepairedDamageno 5.900e-08
## t value
## (Intercept) -244.600
## gearboxautomatic:notRepairedDamageyes -24.940
## gearboxmanual:notRepairedDamageyes -24.930
## gearboxautomatic:notRepairedDamageno 5.660
## gearboxmanual:notRepairedDamageno NA
## vehicleTypepeople carrier:notRepairedDamageyes 4.357
## vehicleTypeconvertible:notRepairedDamageyes 10.928
## vehicleTypecoupe:notRepairedDamageyes 4.082
## vehicleTypesmall car:notRepairedDamageyes -3.139
## vehicleTypeestate:notRepairedDamageyes -5.647
## vehicleTypesedan:notRepairedDamageyes -0.940
## vehicleTypeSUV:notRepairedDamageyes 9.835
## vehicleTypepeople carrier:notRepairedDamageno 8.470
## vehicleTypeconvertible:notRepairedDamageno 20.549
## vehicleTypecoupe:notRepairedDamageno 7.327
## vehicleTypesmall car:notRepairedDamageno -2.782
## vehicleTypeestate:notRepairedDamageno -8.888
## vehicleTypesedan:notRepairedDamageno -5.577
## vehicleTypeSUV:notRepairedDamageno 13.099
## brand_cat.L:notRepairedDamageyes 3.798
## brand_cat.Q:notRepairedDamageyes -2.040
## brand_cat.C:notRepairedDamageyes -4.923
## brand_cat^4:notRepairedDamageyes -4.958
## brand_cat^5:notRepairedDamageyes 0.810
## brand_cat^6:notRepairedDamageyes -13.170
## brand_cat^7:notRepairedDamageyes -20.019
## brand_cat.L:notRepairedDamageno 11.577
## brand_cat.Q:notRepairedDamageno -5.559
## brand_cat.C:notRepairedDamageno -12.134
## brand_cat^4:notRepairedDamageno -16.240
## brand_cat^5:notRepairedDamageno 10.695
## brand_cat^6:notRepairedDamageno -21.441
## brand_cat^7:notRepairedDamageno -36.707
## powerPS:fuelTypeother:notRepairedDamageyes -0.743
## powerPS:fuelTypepetrol:notRepairedDamageyes 14.633
## powerPS:fuelTypecng:notRepairedDamageyes -0.864
## powerPS:fuelTypediesel:notRepairedDamageyes 9.586
## powerPS:fuelTypeelectric:notRepairedDamageyes 1.721
## powerPS:fuelTypehybrid:notRepairedDamageyes 0.329
## powerPS:fuelTypelpg:notRepairedDamageyes 6.326
## powerPS:fuelTypeother:notRepairedDamageno 0.524
## powerPS:fuelTypepetrol:notRepairedDamageno 40.024
## powerPS:fuelTypecng:notRepairedDamageno 2.282
## powerPS:fuelTypediesel:notRepairedDamageno 20.151
## powerPS:fuelTypeelectric:notRepairedDamageno -5.666
## powerPS:fuelTypehybrid:notRepairedDamageno -1.075
## powerPS:fuelTypelpg:notRepairedDamageno 8.032
## fuelTypeother:I(log(powerPS)):notRepairedDamageyes 4.466
## fuelTypepetrol:I(log(powerPS)):notRepairedDamageyes 9.918
## fuelTypecng:I(log(powerPS)):notRepairedDamageyes 4.904
## fuelTypediesel:I(log(powerPS)):notRepairedDamageyes 12.345
## fuelTypeelectric:I(log(powerPS)):notRepairedDamageyes -1.658
## fuelTypehybrid:I(log(powerPS)):notRepairedDamageyes 5.344
## fuelTypelpg:I(log(powerPS)):notRepairedDamageyes 9.646
## fuelTypeother:I(log(powerPS)):notRepairedDamageno 10.787
## fuelTypepetrol:I(log(powerPS)):notRepairedDamageno 53.686
## fuelTypecng:I(log(powerPS)):notRepairedDamageno 18.333
## fuelTypediesel:I(log(powerPS)):notRepairedDamageno 60.660
## fuelTypeelectric:I(log(powerPS)):notRepairedDamageno 16.979
## fuelTypehybrid:I(log(powerPS)):notRepairedDamageno 34.482
## fuelTypelpg:I(log(powerPS)):notRepairedDamageno 51.942
## yearOfRegistration:collector_statusmodern:notRepairedDamageyes 106.307
## yearOfRegistration:collector_statusvintage:notRepairedDamageyes 105.780
## yearOfRegistration:collector_statusmodern:notRepairedDamageno 252.585
## yearOfRegistration:collector_statusvintage:notRepairedDamageno 252.110
## collector_statusmodern:kilometer:notRepairedDamageyes -38.216
## collector_statusvintage:kilometer:notRepairedDamageyes -41.814
## collector_statusmodern:kilometer:notRepairedDamageno -164.767
## collector_statusvintage:kilometer:notRepairedDamageno -141.871
## Pr(>|t|)
## (Intercept) < 2e-16
## gearboxautomatic:notRepairedDamageyes < 2e-16
## gearboxmanual:notRepairedDamageyes < 2e-16
## gearboxautomatic:notRepairedDamageno 1.51e-08
## gearboxmanual:notRepairedDamageno NA
## vehicleTypepeople carrier:notRepairedDamageyes 1.32e-05
## vehicleTypeconvertible:notRepairedDamageyes < 2e-16
## vehicleTypecoupe:notRepairedDamageyes 4.46e-05
## vehicleTypesmall car:notRepairedDamageyes 0.001693
## vehicleTypeestate:notRepairedDamageyes 1.64e-08
## vehicleTypesedan:notRepairedDamageyes 0.347168
## vehicleTypeSUV:notRepairedDamageyes < 2e-16
## vehicleTypepeople carrier:notRepairedDamageno < 2e-16
## vehicleTypeconvertible:notRepairedDamageno < 2e-16
## vehicleTypecoupe:notRepairedDamageno 2.37e-13
## vehicleTypesmall car:notRepairedDamageno 0.005407
## vehicleTypeestate:notRepairedDamageno < 2e-16
## vehicleTypesedan:notRepairedDamageno 2.45e-08
## vehicleTypeSUV:notRepairedDamageno < 2e-16
## brand_cat.L:notRepairedDamageyes 0.000146
## brand_cat.Q:notRepairedDamageyes 0.041367
## brand_cat.C:notRepairedDamageyes 8.53e-07
## brand_cat^4:notRepairedDamageyes 7.13e-07
## brand_cat^5:notRepairedDamageyes 0.417773
## brand_cat^6:notRepairedDamageyes < 2e-16
## brand_cat^7:notRepairedDamageyes < 2e-16
## brand_cat.L:notRepairedDamageno < 2e-16
## brand_cat.Q:notRepairedDamageno 2.72e-08
## brand_cat.C:notRepairedDamageno < 2e-16
## brand_cat^4:notRepairedDamageno < 2e-16
## brand_cat^5:notRepairedDamageno < 2e-16
## brand_cat^6:notRepairedDamageno < 2e-16
## brand_cat^7:notRepairedDamageno < 2e-16
## powerPS:fuelTypeother:notRepairedDamageyes 0.457362
## powerPS:fuelTypepetrol:notRepairedDamageyes < 2e-16
## powerPS:fuelTypecng:notRepairedDamageyes 0.387477
## powerPS:fuelTypediesel:notRepairedDamageyes < 2e-16
## powerPS:fuelTypeelectric:notRepairedDamageyes 0.085257
## powerPS:fuelTypehybrid:notRepairedDamageyes 0.742448
## powerPS:fuelTypelpg:notRepairedDamageyes 2.52e-10
## powerPS:fuelTypeother:notRepairedDamageno 0.600297
## powerPS:fuelTypepetrol:notRepairedDamageno < 2e-16
## powerPS:fuelTypecng:notRepairedDamageno 0.022475
## powerPS:fuelTypediesel:notRepairedDamageno < 2e-16
## powerPS:fuelTypeelectric:notRepairedDamageno 1.47e-08
## powerPS:fuelTypehybrid:notRepairedDamageno 0.282448
## powerPS:fuelTypelpg:notRepairedDamageno 9.60e-16
## fuelTypeother:I(log(powerPS)):notRepairedDamageyes 7.98e-06
## fuelTypepetrol:I(log(powerPS)):notRepairedDamageyes < 2e-16
## fuelTypecng:I(log(powerPS)):notRepairedDamageyes 9.41e-07
## fuelTypediesel:I(log(powerPS)):notRepairedDamageyes < 2e-16
## fuelTypeelectric:I(log(powerPS)):notRepairedDamageyes 0.097307
## fuelTypehybrid:I(log(powerPS)):notRepairedDamageyes 9.09e-08
## fuelTypelpg:I(log(powerPS)):notRepairedDamageyes < 2e-16
## fuelTypeother:I(log(powerPS)):notRepairedDamageno < 2e-16
## fuelTypepetrol:I(log(powerPS)):notRepairedDamageno < 2e-16
## fuelTypecng:I(log(powerPS)):notRepairedDamageno < 2e-16
## fuelTypediesel:I(log(powerPS)):notRepairedDamageno < 2e-16
## fuelTypeelectric:I(log(powerPS)):notRepairedDamageno < 2e-16
## fuelTypehybrid:I(log(powerPS)):notRepairedDamageno < 2e-16
## fuelTypelpg:I(log(powerPS)):notRepairedDamageno < 2e-16
## yearOfRegistration:collector_statusmodern:notRepairedDamageyes < 2e-16
## yearOfRegistration:collector_statusvintage:notRepairedDamageyes < 2e-16
## yearOfRegistration:collector_statusmodern:notRepairedDamageno < 2e-16
## yearOfRegistration:collector_statusvintage:notRepairedDamageno < 2e-16
## collector_statusmodern:kilometer:notRepairedDamageyes < 2e-16
## collector_statusvintage:kilometer:notRepairedDamageyes < 2e-16
## collector_statusmodern:kilometer:notRepairedDamageno < 2e-16
## collector_statusvintage:kilometer:notRepairedDamageno < 2e-16
##
## (Intercept) ***
## gearboxautomatic:notRepairedDamageyes ***
## gearboxmanual:notRepairedDamageyes ***
## gearboxautomatic:notRepairedDamageno ***
## gearboxmanual:notRepairedDamageno
## vehicleTypepeople carrier:notRepairedDamageyes ***
## vehicleTypeconvertible:notRepairedDamageyes ***
## vehicleTypecoupe:notRepairedDamageyes ***
## vehicleTypesmall car:notRepairedDamageyes **
## vehicleTypeestate:notRepairedDamageyes ***
## vehicleTypesedan:notRepairedDamageyes
## vehicleTypeSUV:notRepairedDamageyes ***
## vehicleTypepeople carrier:notRepairedDamageno ***
## vehicleTypeconvertible:notRepairedDamageno ***
## vehicleTypecoupe:notRepairedDamageno ***
## vehicleTypesmall car:notRepairedDamageno **
## vehicleTypeestate:notRepairedDamageno ***
## vehicleTypesedan:notRepairedDamageno ***
## vehicleTypeSUV:notRepairedDamageno ***
## brand_cat.L:notRepairedDamageyes ***
## brand_cat.Q:notRepairedDamageyes *
## brand_cat.C:notRepairedDamageyes ***
## brand_cat^4:notRepairedDamageyes ***
## brand_cat^5:notRepairedDamageyes
## brand_cat^6:notRepairedDamageyes ***
## brand_cat^7:notRepairedDamageyes ***
## brand_cat.L:notRepairedDamageno ***
## brand_cat.Q:notRepairedDamageno ***
## brand_cat.C:notRepairedDamageno ***
## brand_cat^4:notRepairedDamageno ***
## brand_cat^5:notRepairedDamageno ***
## brand_cat^6:notRepairedDamageno ***
## brand_cat^7:notRepairedDamageno ***
## powerPS:fuelTypeother:notRepairedDamageyes
## powerPS:fuelTypepetrol:notRepairedDamageyes ***
## powerPS:fuelTypecng:notRepairedDamageyes
## powerPS:fuelTypediesel:notRepairedDamageyes ***
## powerPS:fuelTypeelectric:notRepairedDamageyes .
## powerPS:fuelTypehybrid:notRepairedDamageyes
## powerPS:fuelTypelpg:notRepairedDamageyes ***
## powerPS:fuelTypeother:notRepairedDamageno
## powerPS:fuelTypepetrol:notRepairedDamageno ***
## powerPS:fuelTypecng:notRepairedDamageno *
## powerPS:fuelTypediesel:notRepairedDamageno ***
## powerPS:fuelTypeelectric:notRepairedDamageno ***
## powerPS:fuelTypehybrid:notRepairedDamageno
## powerPS:fuelTypelpg:notRepairedDamageno ***
## fuelTypeother:I(log(powerPS)):notRepairedDamageyes ***
## fuelTypepetrol:I(log(powerPS)):notRepairedDamageyes ***
## fuelTypecng:I(log(powerPS)):notRepairedDamageyes ***
## fuelTypediesel:I(log(powerPS)):notRepairedDamageyes ***
## fuelTypeelectric:I(log(powerPS)):notRepairedDamageyes .
## fuelTypehybrid:I(log(powerPS)):notRepairedDamageyes ***
## fuelTypelpg:I(log(powerPS)):notRepairedDamageyes ***
## fuelTypeother:I(log(powerPS)):notRepairedDamageno ***
## fuelTypepetrol:I(log(powerPS)):notRepairedDamageno ***
## fuelTypecng:I(log(powerPS)):notRepairedDamageno ***
## fuelTypediesel:I(log(powerPS)):notRepairedDamageno ***
## fuelTypeelectric:I(log(powerPS)):notRepairedDamageno ***
## fuelTypehybrid:I(log(powerPS)):notRepairedDamageno ***
## fuelTypelpg:I(log(powerPS)):notRepairedDamageno ***
## yearOfRegistration:collector_statusmodern:notRepairedDamageyes ***
## yearOfRegistration:collector_statusvintage:notRepairedDamageyes ***
## yearOfRegistration:collector_statusmodern:notRepairedDamageno ***
## yearOfRegistration:collector_statusvintage:notRepairedDamageno ***
## collector_statusmodern:kilometer:notRepairedDamageyes ***
## collector_statusvintage:kilometer:notRepairedDamageyes ***
## collector_statusmodern:kilometer:notRepairedDamageno ***
## collector_statusvintage:kilometer:notRepairedDamageno ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2371 on 240703 degrees of freedom
## Multiple R-squared: 0.7649, Adjusted R-squared: 0.7649
## F-statistic: 1.169e+04 on 67 and 240703 DF, p-value: < 2.2e-16
So this regression achieves \(R^2=0.765\), which seems decent. There are many unsignificant parameters though, most of them for low-count fuel types such as electric or natural gas. To view these results more explicitly, let’s predict price on a few observations selected at random:
## pred_price price vehicleType yearOfRegistration gearbox powerPS
## 1: 5757.3994 6900 estate 2003 automatic 177
## 2: 3462.3774 4200 estate 2003 manual 143
## 3: 5393.7412 6900 sedan 2008 manual 143
## 4: 873.6259 1399 estate 2001 manual 90
## 5: 12265.8002 14500 estate 2013 manual 140
## 6: 2338.2414 3000 people carrier 1996 manual 102
## 7: 6668.3639 5100 estate 2006 automatic 185
## 8: 4891.4393 4950 estate 2008 manual 105
## 9: 2013.9319 2850 small car 2002 manual 90
## 10: 754.9463 590 small car 1994 manual 45
## 11: 4163.1543 5620 small car 2010 manual 69
## 12: 3223.3984 4200 sedan 2002 manual 150
## 13: 4730.7793 5299 sedan 2006 automatic 140
## 14: 10812.7010 7999 small car 2014 automatic 71
## 15: 3366.7457 1600 coupe 1993 manual 150
## 16: 677.4256 990 small car 1997 manual 54
## 17: 899.4355 400 sedan 1996 manual 60
## 18: 4230.1399 2900 people carrier 2005 manual 101
## 19: 7015.7365 10399 estate 2008 automatic 140
## 20: 33452.0384 27999 convertible 2013 automatic 184
## model kilometer monthOfRegistration fuelType brand
## 1: e_klasse 150000 8 diesel mercedes_benz
## 2: c_klasse 150000 2 petrol mercedes_benz
## 3: lancer 100000 10 lpg mitsubishi
## 4: focus 150000 5 diesel ford
## 5: astra 20000 11 petrol opel
## 6: transporter 150000 7 diesel volkswagen
## 7: v70 150000 10 diesel volvo
## 8: passat 150000 1 diesel volkswagen
## 9: one 150000 12 petrol mini
## 10: 1_reihe 150000 7 petrol peugeot
## 11: 500 70000 1 petrol fiat
## 12: passat 150000 3 petrol volkswagen
## 13: octavia 150000 2 diesel skoda
## 14: fortwo 30000 10 petrol smart
## 15: 3er 150000 4 petrol bmw
## 16: micra 150000 9 petrol nissan
## 17: golf 150000 1 petrol volkswagen
## 18: meriva 125000 6 diesel opel
## 19: a6 150000 10 diesel audi
## 20: z_reihe 50000 4 petrol bmw
## notRepairedDamage ad_up_time km_cat brand_cat log_price
## 1: no 1015.75347 150000 premium 3.838849
## 2: no 582.92222 150000 premium 3.623249
## 3: no 249.64861 100000 mid_minus 3.838849
## 4: yes 815.22153 150000 mid_minus 3.145818
## 5: no 119.52153 20000 mid_minus 4.161368
## 6: no 0.00000 150000 mid_plus 3.477121
## 7: no 216.12639 150000 premium_minus 3.707570
## 8: no 105.88333 150000 mid_plus 3.694605
## 9: no 0.00000 150000 premium_minus 3.454845
## 10: no 96.21597 150000 mid_range 2.770852
## 11: no 85.08819 70000 mid_minus 3.749736
## 12: no 1477.16597 150000 mid_plus 3.623249
## 13: no 651.24861 150000 budget_plus 3.724194
## 14: no 239.71250 30000 mid_plus 3.903036
## 15: no 230.05833 150000 premium 3.204120
## 16: no 200.43542 150000 mid_range 2.995635
## 17: no 349.80000 150000 mid_plus 2.602060
## 18: no 0.00000 125000 mid_minus 3.462398
## 19: no 1011.54375 150000 premium 4.016992
## 20: no 349.46319 50000 premium 4.447143
## collector_status
## 1: modern
## 2: modern
## 3: modern
## 4: modern
## 5: modern
## 6: modern
## 7: modern
## 8: modern
## 9: modern
## 10: vintage
## 11: modern
## 12: modern
## 13: modern
## 14: modern
## 15: vintage
## 16: modern
## 17: modern
## 18: modern
## 19: modern
## 20: modern
We see that some predictions are really close whereas others are widely off the mark. Let’s make this more visual:
The red line represents identity
The higher the price, the more the predictions seem to get wrong, which is intuitively logical seeing that we predicted log_price and not price directly. In log10 coordinates, this phenomenon disappears but the opposite appears: Errors seem larger for low prices:
The red line represents identity
We seem to generally underestimate prices a little. Let’s have a look at the residuals:
The residuals vs. fitted plot shows that there is still some pattern left in the data, especially toward the high-end of price predictions where we tend to under-estimate the outcome variable. The normal Q-Q curve shows a lot of departure from the ideal which is characteristic of heavy-tailed data such as we have here. The Scale-Location plot shows that the data is not completely homoscedastic. Finally, there does not seem to be any overly influencial observations according to the Residuals vs. Leverage plot, but we did a lot of cleaning up beforehand to get rid of many outliers. Looking at the three observations for which we are provided indices in the plot, we see that they are all alternative fuel vehicles:
## price vehicleType yearOfRegistration gearbox powerPS model kilometer
## 1: 1650 coupe 2000 automatic 193 clk 150000
## 2: 24850 sedan 2012 automatic 306 other 60000
## 3: 1642 coupe 2015 automatic 362 other 10000
## monthOfRegistration fuelType brand notRepairedDamage ad_up_time
## 1: 7 cng mercedes_benz yes 1553.35833
## 2: 8 hybrid bmw yes 94.83611
## 3: 3 electric bmw no 1496.37500
## km_cat brand_cat log_price collector_status
## 1: 150000 premium 3.217484 modern
## 2: 60000 premium 4.395326 modern
## 3: 10000 premium 3.215373 modern
Using a dataset made available on Kaggle and using data from the German eBay used car ads, I started by doing significant amounts of data cleaning, translations and conversions. I was then able to analyse each variable in turn and noticed that the price variable in particular had a very negatively skewed distribution and that a log transform was required. This initial phase of the analysis also allowed me to discover more issues with the data. Year of registration, price and mileage all had non-sensical entries, most probably due to human errors and maybe cheekiness. Most variables have large amounts of missing values. The brand and model variables tell us a lot about the German market, one of the most high-end market in Europe. Finally, I established that the ad up time was probably not usable as a predictor.
I was then able to look at associations between variables. I established that among the continuous variables, power had highest correlation to price, especially in log form, but with different profiles depending on fuel and transmission types. Looking at mileage was interesting because it exhibited two distinct correlations to price, one for “modern” cars and one for “vintage”. Mileage, another strong contributor, is unfortunately not really a continuous variable in this dataset, as it can only take a few distinct values.
Most categorical variables are also useful to the model. Fuel, gearbox and vehicle type all have associations with price. Unrepaired damages are associated to such a price drop that they litterally create a parallel market, which we must make sure that our linear model is able to capture as distinct from the bulk of the transactions.
Finally, the brand category variable that I manually created turned out to be quite useful as well as it exhibits a significant association to price.
Using the knowledge I had gained from this exploratory analysis, I built a linear model that explains 76.5% of the variance in the data. While this is not a very high level of accuracy, it is a very crude model that cannot account for all associations between variables. For instance, the price of vintage cars can vary differently depending on the brand of the vehicle. Since brand contains over 40 levels, I deliberately aggregated it into categories that try to reflect current market perception, not that of 25 or 30 years ago, so it would not necessarily be adequate as a predictor for vintage car prices. I also left model out of my set of predictors because it has over 200 levels, which is too complex for what I was trying to achieve here. Other potential predictors such as post code and free text description were also left out; while they would require a lot more work to provide sensible information, they could add significantly to the model.
Among other limitations, there is the high level of human error that this dataset seems to contain. To understand the source of some of those errors or to identify some new ones, it would be useful to be able to access the German eBay website as it was when data collection was performed. Similarly, I would like to better understand the criteria and process that were applied when scraping the website.
Finally, it would be interesting to compare this dataset to used car data from other sources to see if the model developped here would give us the same level of accuracy, or if eBay market prices are specific.
As a closing observation, I would point out that we often hear that domain-knowledge is critical to a good data analyst. I don’t know if it is a general rule, but in the present project, I found that an in-depth knowledge of the automotive market definitely helped me make more meaningful analyses, develop relevant theories and engineer useful variables.